Overview:

image.png

image.png

Problem Formulation - Linear Regression

Objective: For independent variables X and dependent variables y, find coefficient matrix B and bias matrix c such that B*X + c = y

image.png

image.png

Assumptions:

  1. Linear functional form: The response variable y should be a linearly related to the explanatory variables X.
  2. Little or no multicollinearity - multicollinearity occurs when the independent variables are too highly correlated with each other.
  3. Residual errors should be i.i.d.: After fitting the model on the training dataset, the residual errors of the model should be independent and identically distributed random variables.
  4. Residual errors should be normally distributed: The residual errors should be normally distributed.
  5. Residual errors should be homoscedastic: The residual errors should have constant variance.

Ref: https://towardsdatascience.com/assumptions-of-linear-regression-5d87c347140

image.png

image.png

image.png

image.png

In [1]:
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from numpy import array
from numpy.linalg import inv
from numpy.linalg import pinv
from sklearn.datasets import make_regression

Linear Regression Solution using Linear Algebra

image.png

(a) Analytical Solution Using Linear Algebra:

In [2]:
# 1. Direct Analytical Solution
X, y = make_regression(n_samples=10, n_features=5, n_targets=1)
#X = np.hstack((X, np.ones((X.shape[0], 1), dtype=X.dtype)))
b = inv(X.T.dot(X)).dot(X.T).dot(y)
print(b)
# predict using coefficients
yhat = X.dot(b)
# plot data and predictions
plt.scatter(yhat, y)
plt.plot(yhat, y)
plt.show()
[63.7682259  23.92716458 10.22698466 17.84119512 21.39222296]

(b) Solution using Matrix Decomposition (SVD)

image.png

In [3]:
# 2. SVD Decomposition
X, y = make_regression(n_samples=10, n_features=5, n_targets=1)
#X = np.hstack((X, np.ones((X.shape[0], 1), dtype=X.dtype)))
b = pinv(X).dot(y)
print(b)
# predict using coefficients
yhat = X.dot(b)
# plot data and predictions
plt.scatter(yhat, y)
plt.plot(yhat, yhat, color='red')
plt.show()
[36.1545294  11.75776356 20.05975628 43.73334549 43.45641295]

Non-linear Regression Solution using curvefit

In [4]:
from scipy.optimize import curve_fit
from pandas import read_csv
from numpy import arange
from scipy.optimize import curve_fit
 
# define the true objective function
def objective(x, a, b, c):
    return a * x + b * x**2 + c 
 
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/longley.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# choose the input and output variables
x, y = data[:, 4], data[:, -1]
# curve fit
popt, _ = curve_fit(objective, x, y)
# summarize the parameter values
a, b, c = popt
print('y = %.5f * x + %.5f * x^2 + %.5f' % (a, b, c))
# plot input vs output
plt.scatter(x, y)
# define a sequence of inputs between the smallest and largest known inputs
x_line = arange(min(x), max(x), 1)
# calculate the output for the range
y_line = objective(x_line, a, b, c)
# create a line plot for the mapping function
plt.plot(x_line, y_line, '--', color='red')
plt.show()
y = 3.25444 * x + -0.01170 * x^2 + -155.02799

Using sklearn for Linear Regression solution on randomly generated data

In [6]:
import pyforest
from sklearn import linear_model
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

X = np.random.randn(1000,5)
y = np.random.randn(1000,2)
X = pd.DataFrame(X)
X.columns = ['X1','X2','X3','X4','X5']
y = pd.DataFrame(y)
y.columns = ['y1','y2']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = linear_model.LinearRegression(fit_intercept=True, normalize=True)
model.fit(X_train,y_train)

y_train_pred = pd.DataFrame(model.predict(X_train))
y_train_pred.columns = ['y1_pred','y2_pred']
y_test_pred = pd.DataFrame(model.predict(X_test))
y_test_pred.columns = ['y1_pred','y2_pred']

print("R2 Scores = ", r2_score(y_test,y_test_pred,multioutput='raw_values'))
print("X Coefficients = " + str(model.coef_))
print("Bias = " + str(model.intercept_))
plt.scatter(y_test['y1'],y_test_pred['y1_pred'])
plt.plot(y_test['y1'],y_test['y1'], color='blue')
plt.xlabel('y_true',fontsize = 12)
plt.ylabel('y_pred',fontsize = 12)
R2 Scores =  [-0.01897842 -0.00296784]
X Coefficients = [[ 0.02804228 -0.01293924 -0.04024036  0.03687892  0.02040313]
 [-0.00136547  0.02601828 -0.03123583  0.02312271  0.0030394 ]]
Bias = [-0.03420393 -0.08097536]
Out[6]:
Text(0, 0.5, 'y_pred')

Since none of the above assumptions for linear regression are satisfied, the fit is bad!

Steps in Machine Learning (Tabular Data):

Step 1: Import the data-set and libraries

Step 2: Identify numerical, categorical and datetime features

Step 3: Perform EDA - summary statistics, correlations, identification of null/missing values, univariate/bivariate analysis (chi2 analysis, point biserial correlation)

Step 4: Remove irrelevant/highly correlated features

Step 5: Train-Test split

Step 6: Missing value imputation, categorical feature encoding, scaling on train (fit_transform method) and apply the same to test (transform method)

Step 7: Oversampling/Undersampling and dimensionality reduction (if required)

Step 8: Outlier detection (and removal) on X_train, y_train

Step 9: Feature Engineering (Box-Cox, Yeo-Johnson transformation, LASSO, SISSO etc.)

Step 10: Set up models and evaluation metrics in an n-fold CV environment

Step 11: Fit the models and do hyperparameter tuning

Step 12: Use the best performing model to make predictions on unseen test dataset

Step 13: Add regularization/do further hyperparameter tuning to mitigate overfitting (if required)

Step 14: Use the best model with its optimal hyperparameters to fit on the entire dataset (train + test)

Step 15: During inference time, subject the inference data to preprocessing steps (Step 4, Step 6 and Step 9) and make live predictions using above model

image.png

image.png

image.png

AdaBoost :

image.png

XGBoost:

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

(g) Gaussian Naive Bayes

https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/

Main Hyperparameters: None

image.png

image.png

Regression Problem: Restaurant Revenue prediction (Linear and Non-Linear Regression)

In [17]:
import dtale
data = pd.read_csv(r'C:\Users\suryanaman.c\Desktop\restaurant-revenue-prediction\train.csv')
d = dtale.show(data)
#d.open_browser()
In [15]:
import pyforest
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import ADASYN 
import warnings
warnings.filterwarnings('ignore')

#config parameters
path = r'C:\Users\suryanaman.c\Desktop\restaurant-revenue-prediction\train.csv'
categorical_features = ['City', 'City Group', 'Type']
numerical_features = ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10', 'P11',
                       'P12', 'P13', 'P14', 'P15', 'P16', 'P17', 'P18', 'P19', 'P20', 'P21',
                       'P22', 'P23', 'P24', 'P25', 'P26', 'P27', 'P28', 'P29', 'P30', 'P31',
                       'P32', 'P33', 'P34', 'P35', 'P36', 'P37']
date_feature = ['Open Date']

#function definitions
def data_fetch(path):
    file_path = path
    file = pd.read_csv(path)
    return file

def data_prep(data):     
    
    thresh = len(data)*0.9
    data = data.dropna(thresh = thresh, axis = 1)
    
    # Read about correlations between features (Pearson for numerical-numerical, Point Biserial for Numerical-binary and Chi2 for Categorical)
    # Link: https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223
    # Link: https://towardsdatascience.com/point-biserial-correlation-with-python-f7cd591bd3b1#:~:text=Linear%20regression%20is%20a%20classic,have%20an%20almost%20linear%20relationship.&text=That%20is%20where%20point%20biserial%20correlation%20comes%20to%20our%20aid.
    
    def imputation_encoding_and_train_test_split(data):
        X = data[numerical_features+categorical_features+date_feature]
        X[date_feature] = X[date_feature].astype('datetime64[ns]')
        X['delta_time'] = pd.to_datetime("now") - X[date_feature]
        X['delta_time'] = X['delta_time'].dt.days
        X = X.drop(columns = date_feature)
        date_feature_new = ['delta_time']
        y = data['revenue']

        # Train Test Split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
        
        #X_train[numerical_features] = X_train[numerical_features].apply(pd.to_numeric, errors = 'coerce')
        imputer_num = SimpleImputer(missing_values = np.nan, strategy = 'mean')
        scaler = QuantileTransformer(n_quantiles = 10)
        
        # Imputation and MinMax scaling for numerical features
        X_train[numerical_features+date_feature_new] = imputer_num.fit_transform(X_train[numerical_features+date_feature_new])
        X_train[numerical_features+date_feature_new] = scaler.fit_transform(X_train[numerical_features+date_feature_new])
        X_test[numerical_features+date_feature_new] = imputer_num.transform(X_test[numerical_features+date_feature_new])
        X_test[numerical_features+date_feature_new] = scaler.transform(X_test[numerical_features+date_feature_new])
        
        # Mode Imputation and Dummy Encoding for categorical features
        for col in categorical_features:
            mode_category = X_train[col].mode()[0]
            X_train[col] = X_train[col].fillna(mode_category)
            X_test[col] = X_test[col].fillna(mode_category)

        dummies_train = pd.get_dummies(X_train[categorical_features])
        X_train = pd.concat([X_train[numerical_features+date_feature_new], dummies_train], axis = 1)
        dummies_test = pd.get_dummies(X_test[categorical_features])
        X_test = pd.concat([X_test[numerical_features+date_feature_new], dummies_test], axis = 1)

        missing_cols = set(X_train.columns) - set(X_test.columns)

        # Add a missing column in test set with default value equal to 0
        for c in missing_cols:
            X_test[c] = 0

        # Ensure the order of column in the test set is in the same order than in train set
        X_test = X_test[X_train.columns]
        return X_train, y_train, X_test, y_test
        
    def oulier_removal(X_train, y_train):
        from sklearn.ensemble import IsolationForest
        iso = IsolationForest(contamination=0.01)
        # Link - https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e
        X_tr_np = X_train.to_numpy()
        y_tr_np = y_train.to_numpy()
        yhat = iso.fit_predict(X_tr_np)
        # select all rows that are not outliers
        mask = yhat != -1
        X_tr_np, y_tr_np = X_tr_np[mask, :], y_tr_np[mask]
        X_train_new = pd.DataFrame(X_tr_np, columns = X_train.columns)
        y_train_new = y_tr_np
        return X_train_new, y_train_new
    
    def feature_engineering(X_train, y_train, X_test, y_test):
        from autofeat import FeatureSelector, AutoFeatRegressor
        afreg = AutoFeatRegressor(verbose=1, feateng_steps = 2, featsel_runs = 1)
        #Link - https://arxiv.org/pdf/1901.07329.pdf
        X_train_af = afreg.fit_transform(X_train, y_train)
        X_test_af = afreg.transform(X_test)
        new = set(X_train_af.columns) - set(X_train.columns)
        print("New Features after feature engineering: "+str(new))
        return X_train_af, X_test_af
    
    X_train, y_train, X_test, y_test = imputation_encoding_and_train_test_split(data)
    #X_train, y_train = oulier_removal(X_train, y_train)
    #X_train, X_test = feature_engineering(X_train, y_train, X_test, y_test)
    return X_train, y_train, X_test, y_test

def model_train(X_train, y_train, X_test, y_test):
    # Linear models
    reg1 = GridSearchCV(LinearRegression(), {'fit_intercept': ['True', 'False']}, cv=5, scoring = 'r2')
    reg2 = GridSearchCV(Ridge(), {'alpha':[0.1], 'fit_intercept': ['True', 'False']}, cv=5, scoring = 'r2')
    reg3 = GridSearchCV(Lasso(), {'alpha':[0.1], 'fit_intercept': ['True', 'False']}, cv=5, scoring = 'r2')
    reg4 = GridSearchCV(ElasticNet(), {'alpha':[0.01], 'l1_ratio':[0.5], 'fit_intercept': ['True', 'False']}, cv=5, scoring = 'r2')
    reg5 = GridSearchCV(SGDRegressor(), {'alpha':[0.0001,0.01,0.1], 'max_iter': [10000]}, cv=5, scoring = 'r2')
    
    # Ensemble models
    reg6 = GridSearchCV(DecisionTreeRegressor(), {'max_depth':[25],'random_state':[0]}, cv=5, scoring = 'r2')
    reg7 = GridSearchCV(RandomForestRegressor(), {'max_depth':[20], 'n_estimators':[100], 'bootstrap': [True], 'min_samples_leaf': [1], 'max_features':['auto'], 'criterion':['mse'], 'random_state':[0]}, cv=5, scoring = 'r2')
    reg8 = GridSearchCV(AdaBoostRegressor(), {'base_estimator': [RandomForestRegressor()], 'n_estimators':[100], 'learning_rate':[1], 'loss':['exponential'], 'random_state':[0]}, cv=5, scoring = 'r2')
    reg9 = GridSearchCV(XGBRegressor(), {'max_depth':[20], 'n_estimators':[100], 'random_state':[0]}, cv=5, scoring = 'r2')
    
    # Kernel based models
    reg10 = GridSearchCV(SVR(), {'kernel':['rbf'], 'C':[1], 'gamma':[0.001]}, cv=5, scoring = 'r2')
    reg11 = GridSearchCV(KernelRidge(), {'kernel':['laplacian'], 'alpha':[0.1], 'gamma':[0.05]}, cv=5, scoring = 'r2')
    
    # Other models
    reg12 = GridSearchCV(KNeighborsRegressor(), {'n_neighbors':[3], 'weights':['distance'], 'algorithm':['auto'], 'leaf_size': [1,2,3], 'p':[2]}, cv=5, scoring = 'r2')
    reg13 = GridSearchCV(MLPRegressor(), {'hidden_layer_sizes':(5,), 'max_iter':[10000]}, cv=5, scoring = 'r2')
    #----------------------------------------------------------------------------------------------------------------
    
    # Define regressor, fit and evaluate the model
    reg = reg7
    reg.fit(X_train,y_train)
    score_train = np.mean(np.abs((y_train - reg.predict(X_train)) / y_train)) * 100
    score_test = np.mean(np.abs((y_test - reg.predict(X_test)) / y_test)) * 100
    score_train_rmse = np.sqrt(((reg.predict(X_train) - y_train) ** 2).mean())
    score_test_rmse = np.sqrt(((reg.predict(X_test) - y_test) ** 2).mean())
    y_train_pred = reg.predict(X_train)
    y_test_pred = reg.predict(X_test)
    print("Regression Model: \n")
    print("MAPE on the training dataset is: " + str(score_train))
    print("MAPE on the test dataset is: " + str(score_test))
    print("\nRMSE on the training dataset is: " + str(score_train_rmse))
    print("RMSE on the test dataset is: " + str(score_test_rmse))
    return score_train, score_test, y_train, y_train_pred, y_test, y_test_pred

def visualize_results(y_train, y_pred_train, y_test, y_test_pred):
    fig = plt.figure()
    fig.set_figheight(10)
    fig.set_figwidth(6)

    ax1 = fig.add_subplot(211)
    ax1.scatter(y_train, y_train_pred)
    ax1.plot(y_train, y_train, color='red')
    ax1.set_xlim([0, 10000000])
    ax1.set_ylim([0, 10000000])
    ax1.set_xlabel('Actual Revenue',fontsize = 12)
    ax1.set_ylabel('Predicted Revenue',fontsize = 12)
    #ax1.set_aspect('equal')

    ax2 = fig.add_subplot(212)
    ax2.scatter(y_test, y_test_pred)
    ax2.plot(y_test, y_test, color='red')
    ax2.set_xlim([0, 10000000])
    ax2.set_ylim([0, 10000000])
    ax2.set_xlabel('Actual Revenue',fontsize = 12)
    ax2.set_ylabel('Predicted Revenue',fontsize = 12)
    #ax2.set_aspect('equal')

    plt.show()

# Combine all the operations and display
if __name__ == '__main__':
    data = data_fetch(path)
    X_train, y_train, X_test, y_test = data_prep(data)
    score_train, score_test, y_train, y_train_pred, y_test, y_test_pred = model_train(X_train, y_train, X_test, y_test)
    visualize_results(y_train, y_train_pred, y_test, y_test_pred)
Regression Model: 

MAPE on the training dataset is: 15.968721013396944
MAPE on the test dataset is: 25.403170470421244

RMSE on the training dataset is: 907199.1569848048
RMSE on the test dataset is: 2932616.630641923

Classification Problem - Promotion Dataset

image.png

In [16]:
import pyforest
import dtale
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix, plot_confusion_matrix, plot_precision_recall_curve
from sklearn.svm import SVC
from imblearn.over_sampling import ADASYN 
import warnings
warnings.filterwarnings('ignore')

#config parameters
path = r'C:\Users\suryanaman.c\Desktop\train.csv'
numerical_features = ['no_of_trainings', 'age', 'length_of_service', 'awards_won?', 'avg_training_score']
categorical_features = ['department', 'region', 'education', 'gender', 'recruitment_channel', 'previous_year_rating', 'KPIs_met >80%']

def data_fetch(path):
    file_path = path
    file = pd.read_csv(path)
    return file

def data_prep(data):     
    X = data[numerical_features+categorical_features]
    y = data['is_promoted']
    print("Original Class distribution is: "+str(y.value_counts()[0])+":"+str(y.value_counts()[1]))
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify = y)

    imputer_num = SimpleImputer(missing_values = np.nan, strategy = 'mean')
    scaler = MinMaxScaler()

    X_train[numerical_features] = imputer_num.fit_transform(X_train[numerical_features])
    X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])

    for col in categorical_features:
        mode_category = X_train[col].mode()[0]
        X_train[col] = X_train[col].fillna(mode_category)
        X_test[col] = X_test[col].fillna(mode_category)

    dummies_train = pd.get_dummies(X_train[categorical_features])
    X_train = pd.concat([X_train[numerical_features], dummies_train], axis = 1)

    X_test[numerical_features] = imputer_num.transform(X_test[numerical_features])
    X_test[numerical_features] = scaler.transform(X_test[numerical_features])

    dummies_test = pd.get_dummies(X_test[categorical_features])
    X_test = pd.concat([X_test[numerical_features], dummies_test], axis = 1)
    
    missing_cols = set(X_train.columns) - set(X_test.columns)
    
    # Add a missing column in test set with default value equal to 0
    for c in missing_cols:
        X_test[c] = 0
    
    # Ensure the order of column in the test set is in the same order than in train set
    X_test = X_test[X_train.columns]
    
    def minority_oversamling(X_train,y_train):
        # Minority Oversampling using ADASYN (try SMOTE vs ADASYN)
        oversampling = ADASYN(sampling_strategy = 0.9, random_state = 0, n_neighbors = 10)
        X_train_oversample, y_train_oversample = oversampling.fit_resample(X_train, y_train)
        return X_train_oversample, y_train_oversample
    
    X_train, y_train = minority_oversamling(X_train,y_train)
    print("Class distribution after oversampling is: "+str(y_train.value_counts()[0])+":"+str(y_train.value_counts()[1]))
    return X_train, y_train, X_test, y_test

def model_train(X_train, y_train, X_test, y_test):
    clf1 = GridSearchCV(RandomForestClassifier(random_state = 0, class_weight = 'balanced'), {'max_depth':[20],'n_estimators':[50],'criterion':['gini'],'random_state':[0]},cv=5)
    clf2 = GridSearchCV(SVC(class_weight = 'balanced'), {'kernel':['rbf'],'C':[10],'gamma':[0.05]},cv=5)
    
    clf1.fit(X_train,y_train)
    score_train = f1_score(y_train, clf1.predict(X_train))
    score_test = f1_score(y_test, clf1.predict(X_test))
    print("Classification Model: \n")
    print("F1-Score on the training dataset with 5-fold CV is: " + str(score_train))
    print("\nF1-Score on the test dataset is: " + str(score_test))
    print(confusion_matrix(y_test, clf1.predict(X_test)))
    plot_precision_recall_curve(clf1, X_test, y_test)
    
    return score_test

if __name__ == '__main__':
    data = data_fetch(path)
    X_train, y_train, X_test, y_test = data_prep(data)
    score_test = model_train(X_train, y_train, X_test, y_test)
Original Class distribution is: 50140:4668
Class distribution after oversampling is: 40112:34675
Classification Model: 

F1-Score on the training dataset with 5-fold CV is: 0.9275310443569481

F1-Score on the test dataset is: 0.42457337883959045
[[8654 1374]
 [ 312  622]]

image.png

In [ ]: